Load data.
traindf <- fread('statistical-learning-hw03-2022/train_hw03.csv')
testdf <- fread('statistical-learning-hw03-2022/test_hw03.csv')
Let’s observe the train dataset and handle the data accordingly.
As stated in the assignment, for every observation we have:
Lets define these variables:
# ROI 116, time slots 115
nroi=116
ntime=115
The procedure is now to create a functional connectome of every observation: as linked in the paper, a Connectome is a network that describe the neural connections of the ROIs by computing a statistical association measure between time series of ROIs.
To start, we had to organize the dataset in a different way. We created a list of datasets of every single observation: every dataset have all the ROI’s stacked one over the other as a matrix ROIxTIME.
## CREATE LIST OF DATASETS OF A SINGLE OBSERVATION
train_OBS_l = list()
for (i in 1:nrow(traindf)){ #scan every observation
#each time, create empty matrix of nroixntime
df_obs <- matrix(NA,nrow = nroi, ncol = ntime)
for (j in 1:nroi){
##fill the matrix with slices of the obs
df_obs[j,] <- as.numeric(traindf[i,(5+(j-1)*ntime):(5+(ntime-1) +(j-1)*ntime)])
}
#save in the list of datasets
train_OBS_l <- append(train_OBS_l,list(df_obs))
}
remove(df_obs)
## Now we have every observation in order in a df list
train_OBS_l[[1]][1:10,1:10]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] -0.458933 -0.471744 -0.461779 -0.305860 0.047765 0.356486 0.271883
## [2,] -0.199841 -0.620084 -0.727891 -0.379165 0.110115 0.338624 0.264586
## [3,] 0.811157 0.252584 -0.688949 -1.414260 -1.417864 -0.804392 -0.207096
## [4,] 0.760116 -0.002081 -1.185835 -1.955906 -1.746215 -0.808000 0.044265
## [5,] 0.551511 1.660368 1.987118 0.944090 -0.804357 -1.857726 -1.545568
## [6,] 0.774092 1.039387 0.771135 -0.333226 -1.727714 -2.301783 -1.537867
## [7,] 0.364723 0.303796 -0.058658 -0.523826 -0.665585 -0.338497 0.074766
## [8,] 0.716399 0.136022 -0.904792 -1.997585 -2.557935 -2.254249 -1.365761
## [9,] 0.029625 -1.099732 -1.652020 -1.620716 -1.318615 -0.815098 -0.002540
## [10,] 0.333594 -0.158951 -1.133092 -1.917007 -1.506452 0.304900 2.333011
## [,8] [,9] [,10]
## [1,] -0.198189 -0.550521 -0.257471
## [2,] 0.171200 0.241526 0.338452
## [3,] -0.119524 -0.365243 -0.341567
## [4,] 0.215255 -0.177573 -0.545698
## [5,] -0.736431 -0.901073 -2.320650
## [6,] -0.234377 0.254925 -0.435582
## [7,] 0.062514 -0.396219 -0.764966
## [8,] -0.607326 -0.444168 -0.576089
## [9,] 0.869891 1.151686 0.540851
## [10,] 2.850916 1.222012 -1.361021
Visual Representation.
This is the heatmap of one dataset we created: observation n 1 displayed.
Other statistics and visualization of Observation 1.
For every Observation-dataset, take the correlation between the ROI’s.
##first tryout with obs 1
OBS=1
## correlation matrix is a column based function, so it must be transposed first
# if we want correlation between the Rows (ROI)
#Absolute values
cor_mat1 <- abs(cor(t(train_OBS_l[[OBS]])))
### CONNECTOME DATASET
## DEFINE FUNCTION TO CREATE CONNECTOME
Create_Connect_abs_df <- function(df_list, nroi1){
OBS=length(df_list)
#Empty matrix to fill, dimension reduced
Connect_diag_mat1 <- matrix(NA,nrow = OBS, ncol = nroi1*(nroi1-1)/2)
for ( i in 1:OBS){
#1. scan all observation
#2. make corr matrix
cor_mat_i <- cor(t(df_list[[i]]))
## 2.1 HANDLE THE NAS!
cor_mat_i[is.na(cor_mat_i)] <- 0
#3. take only upper triangular matrix without the diagonal, and flatten
cor_arr_i <- cor_mat_i[upper.tri(cor_mat_i, diag = F)]
remove(cor_mat_i)
#4. save it in new matrix
Connect_diag_mat1[i,] <- abs(cor_arr_i)
}
return(Connect_diag_mat1)
}
Connect_diag_mat_abs_ <- Create_Connect_abs_df(train_OBS_l,nroi)
Connect_diag_mat_abs2 <- data.frame(Connect_diag_mat_abs_,
age=traindf$age,sex=as.numeric(((traindf$sex=='male')*1)),y=as.factor(traindf$y))
nrow(Connect_diag_mat_abs2)
## [1] 600
ncol(Connect_diag_mat_abs2)
## [1] 6673
The final Dataset has 600 rows, same observation as the start, and 6673 columns, the connectome correlations between ROIs flattened.
N_features=50
#KPCA
dt<-as.matrix(Connect_diag_mat_abs2[,-6673])
rbf<-rbfdot(sigma=0.01)
km<-kernelMatrix(rbf,dt)
kpc <- kpca(km,data=Connect_diag_mat_abs2[,-6673],kernel="rbfdot",kpar=list(sigma=0.2),features=N_features)
#KERNEL COMPONENTS as new dataset
kern_comp <- pcv(kpc)
kern_comp2 <- data.frame(kern_comp, y=as.factor(traindf$y))
The Number of features = 50 was treated like a hyperparameter and cross-validated as the best number of features to get the best accuracy.
Due to the time - consuming task, in order to handle the rmarkdown, the tuning code is reported but not executable. The parameters tuned were the type of kernel, the Cost and the Gamma parameters. We can run the best performance revealed by the process, with parameters:
## tuning
#tune_out <-
# tune.svm(x = kern_comp2[, -51], y = kern_comp2[, 51],
# type = "C-classification",
# kernel = c('polynomial','radial'),
# cost = 10^(1:5),
# gamma = c(0.1,0.01,0.001,0.0001),
# cross=10, nrepeat=15)
#
#
#summary(tune_out)
#1 - tune_out$best.performance
set.seed(1)
tune_out <-
tune.svm(x = kern_comp2[, -(N_features+1)], y = kern_comp2[, (N_features+1)],
type = "C-classification",
kernel = "radial", cost = 100,
gamma = 0.0001, cross=10, nrepeat=15)
summary(tune_out)
##
## Error estimation of 'svm' using 10-fold cross validation: 0.3783333
1 - tune_out$best.performance
## [1] 0.6216667
This model is performing with a stable accuracy over 0.63, much better than any other model and preprocess of the data that we tried (reported later).
In order to predict the new test values, we now have to pre-process the test data in the same way. We have to generate a Connectome dataset first:
#Generate Connectome test dataset
#1. CREATE LIST OF DATAFRAMES
test_OBS_l = list()
for (i in 1:nrow(testdf)){ #scan every observation
if(i/nrow(testdf)==0.5) print('halfway')
#each time, create empty matrix of nroixntime
df_obs <- matrix(NA,nrow = nroi, ncol = ntime)
for (j in 1:nroi){
##fill the matrix with slices of the obs
df_obs[j,] <- as.numeric(testdf[i,(4+(j-1)*ntime):(4+(ntime-1) +(j-1)*ntime)])
}
#save in the list of datasets
test_OBS_l <- append(test_OBS_l,list(df_obs))
}
remove(df_obs)
## Now we have every observation in order in a df list
#test_OBS_l[[1]]
#2. CREATE CONNECTOME DATASET
Connect_test <- Create_Connect_abs_df(test_OBS_l,nroi)
nrow(Connect_test)
## [1] 199
ncol(Connect_test)
## [1] 6670
Data handling for KPCA - SVM with train and Test:
#train new dataframe
train_new <- data.frame(Connect_diag_mat_abs_, age=traindf$age, sex= ((traindf$sex=='male')*1))
#test new dataframe
test_new <- data.frame(Connect_test, age=testdf$age, sex = ((testdf$sex=='male')*1))
#binding together the data to perform kpca
total_new <- rbind(train_new, test_new)
################# KPCA -> SVM PIPELINE
dt2<-as.matrix(total_new)
rbf2<-rbfdot(sigma=0.01)
km2<-kernelMatrix(rbf2,dt2)
#choose a number of features to extract
N_feat=50
kpc2 <- kpca(km2,data=total_new,kernel="rbfdot",kpar=list(sigma=0.2),features=N_feat)
# CREATE DATAFRAMES FROM KERNEL COMPONENTS
kern_comp_test <- pcv(kpc2)
kern_comp_test2 <- data.frame(kern_comp_test)
#SPLIT IN TRAIN AND TEST
train_kern_kaggle <- data.frame(kern_comp_test2[1:600,], y=as.factor(traindf$y))
test_kern_kaggle <- kern_comp_test2[601:799,]
## SVM with TUNED PARAMETERS Gamma 0.0001, cost 100, N_features 50.
svmfit_kern_6 = svm(y ~ ., data = train_kern_kaggle, kernel = "radial", gamma=10^(-4), cost = 100, scale = TRUE)
#prediction
pred_kern_test6 <- predict(svmfit_kern_6,test_kern_kaggle)
df <- data.frame(testdf$id,pred_kern_test6)
colnames(df) <- c('id','target')
head(df)
## id target
## 601 2 autism
## 602 3 autism
## 603 4 control
## 604 5 autism
## 605 6 control
## 606 8 autism
#write.csv(df,file = 'Prediction_kern6.csv',row.names = F )
After the KPCA reduced connectome dataset, we tried to use also a Random Forest Model.
#Tuning process
#control <- trainControl(method='repeatedcv',
# number=10,
# repeats=3,
# search = 'random')
#Random generate 15 mtry values with tuneLength = 15
#set.seed(1)
#rf_random <- train(y ~ .,
# data = kern_comp2,
# method = 'rf',
# metric = 'Accuracy',
# tuneLength = 15,
# trControl = control)
#print(rf_random)
#Simple Random Forest model tuned with: mtry=5, nodesize 3,samplesize 200
set.seed(1)
rf.model <- randomForest(formula = y ~ ., mtry = 5,
ntree=5000,
nodesize = 3,
sampsize = 200,
data = kern_comp2, importance=T, cross=10)
rf.model
##
## Call:
## randomForest(formula = y ~ ., data = kern_comp2, mtry = 5, ntree = 5000, nodesize = 3, sampsize = 200, importance = T, cross = 10)
## Type of random forest: classification
## Number of trees: 5000
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 39.67%
## Confusion matrix:
## autism control class.error
## autism 109 167 0.6050725
## control 71 253 0.2191358
After Tuning the Random Forest, the best Accuracy we can get is still 0.60/0.61, less than the SVM.
Instead of taking the simple Pearson correlation between ROI’s, another strategy is to account for some time-delay correlation measure, like cross-correlation. In this way, if two ROI’s have small/bad correlation in real-time but similar behaviors after a little delay of time instances, the cross-correlation value still going to be set in a greater value than the first correlation.
##Create function that perform cross correlation
r_jk <- function(xj,xk,d){
d=abs(d) # is the delay parameter
N=length(xj)
xj_s <- xj[d:N] #the two arrays are cut as the delay
xk_s <- xk[1:(N-d +1)]
xjkprod <- xj_s*xk_s
rjk_d <- (1/N)*(sum(xjkprod)) #cross correlation
return(abs(rjk_d))
}
#Create list of cross-correlated values varying d
List_maker <- function(xk,xj){
N=length(xj)
list_r <- c()
for (d in 1:floor(N/4)){
list_r <- append(list_r,r_jk(xj,xk,d))
}
return(list_r)
}
#The Strength of the connection between two ROis is set with:
# Strength = 1 / d, with d the delayed steps where the cross-correlation is higher.
Strength_maker <- function(xj,xk){
list_rjk <- List_maker(xj,xk)
list_rkj <- List_maker(xk,xj)
idxmaxjk <- which(list_rjk == max(list_rjk))
idxmaxkj <- which(list_rkj == max(list_rkj))
Strength <- 1/(min(idxmaxjk,idxmaxkj))
return(Strength)
}
Create_Delayed_flat_corr <- function (df){
nrows <-nrow(df)
Delayed_corr <- matrix(NA, nrow = nrows, ncol = nrows)
for (i in 1:nrows){
xj <- df[i,]
for (j in 1:nrows){
xk <- df[j,]
if (i<j){
Delayed_corr[i,j] <- Strength_maker(xj,xk)
}
}
}
Flat_delayed <- Delayed_corr[upper.tri(Delayed_corr, diag = F)]
return(Flat_delayed)
}
Obs1_Cross_Cor <- Create_Delayed_flat_corr(train_OBS_l[[1]])
head(Obs1_Cross_Cor)
## [1] 1.0000000 0.1428571 0.5000000 0.5000000 0.1250000 1.0000000
Whole dataset of Delayed Correlation is performed, but not executed here for time reasons.
Create_Connect_delay <- function(df_list, nroi1){
OBS=length(df_list)
#Empty matrix to fill, dimension reduced
Connect_diag_mat1 <- matrix(NA,nrow = OBS, ncol = nroi1*(nroi1-1)/2)
for ( i in 1:OBS){
#print(i)
Connect_diag_mat1[i,] <- Create_Delayed_flat_corr(train_OBS_l[[i]])
}
return(Connect_diag_mat1)
}
#Connect_delay_df <- Create_Connect_delay(train_OBS_l,nroi) ## 15 Mins!
We use as our 10 variables the first 10 KPCA components of our Connectome dataset: the goal is to see if there is going to be some difference between the variable importance of these features. Since a KPCA transformation does not have the PCA maximized variability which results in better “importance” of the first ones, but insted spread the variance in all KPCA components, we are not expecting great changes in feature importance, but it may still appear.
##rename for PART B LOCO
train <- data.frame(train_kern_kaggle[,1:10], y=train_kern_kaggle$y)
#ncol(train)
train_num <- train
#Create second train for part B LOCO
train_num$y <- as.numeric(train$y=='autism')
head(train_num)
## X1 X2 X3 X4 X5 X6
## 1 0.13286807 0.55131373 0.1833399 -0.007258511 -0.7356222 0.4293599
## 2 -0.09879911 0.02348734 -0.2451273 -0.130780700 -1.2525655 -0.1968014
## 3 0.12307931 0.46510625 -0.2897051 0.392714048 -0.4528480 -1.2880354
## 4 0.08279303 0.16586724 -0.5705679 0.082173454 -0.7231745 0.1823465
## 5 0.06960691 0.64648680 0.2294291 -0.360663597 0.1583193 0.6889257
## 6 -0.09232986 0.21589996 -0.3038602 -0.611958225 0.6165181 -0.8624497
## X7 X8 X9 X10 y
## 1 0.9941348 0.51640482 -1.51115222 0.33822127 1
## 2 -0.4919303 -1.12933693 -0.05530204 0.18003017 1
## 3 -0.5192125 1.62163267 0.82880503 -1.66081437 1
## 4 -0.6152335 -0.06812738 0.28924918 1.28934189 0
## 5 0.9268265 -0.23072830 -0.73551103 -0.09882619 0
## 6 3.1020376 0.23893820 0.14242012 0.30512148 1
Lets define a Function that performs A Leave One Covariate Out Analysis for every variable in the dataset. The parameters of this function are
#LOCO Function
Create_LOCO_num <- function(df,Proportion,alpha=0.9){
#How many Columns we need to LOCO? (Ncols -1)
Ncols <- ncol(df)
#1. Split the df randomly in two datasets, D1 and D2.
# Parameter Proportion handles how much Proportion of dataset goes on the D2 "test" dataset
n1 <- floor(nrow(df)/Proportion)
idx <- sample(nrow(df),n1,replace = FALSE)
D1 <- df[-idx,]
D2 <-df[idx,]
#2. Compute estimate f_n1_hat
svmfit_LOCO = svm(y ~ ., data = D1, kernel = "radial", gamma=0.001, cost = 100, scale = TRUE)
#predict on D2
f_n1_hat <- predict(svmfit_LOCO,D2[-(Ncols)])
## Initialize var that are going to store the Theta point estimate and Upper and Lower CI.
Delta_j_list <- c()
Lower <- c()
Upper <- c()
## Iterating through all the Variables in the df
Variables <- seq(1,(Ncols-1),1)
for (Var_loco in Variables){
#print(Var_loco)
#3. select A variable j and compute estimate f_n1_wj_hat
seq_cols <- seq(1,(Ncols),1)
#columns without j-th var
seq_cols_w = seq_cols[! seq_cols %in% Var_loco]
#Create Partial Datasets without j-var
D1_wj <- D1[,seq_cols_w]
D2_wj <- D2[,seq_cols_w]
#new dataset D1_wj and D2_wj without variable j
#Fit without j
svmfit_LOCO_wj = svm(y ~ ., data = D1_wj, kernel = "radial", gamma=0.001, cost = 100, scale = TRUE)
#4. Predict the f_n1_wj_hat with the D2 dataset
f_n1_wj_hat <- predict(svmfit_LOCO_wj,D2_wj[-(Ncols-1)])
#Then calculate difference between the losses: We use Absolute Error as in the linked paper of Lei et al.
real_y <- D2$y
f_hat <- f_n1_hat
f_wj_hat <- f_n1_wj_hat
##List of values
delta_j <- abs(real_y-f_wj_hat) - abs(real_y - f_hat)
#Use this list for CI.
###Bootstrapping method for Confidence Interval
npbt_method2 <- np.boot(delta_j,R=50000, statistic = median,level = alpha)
##Save Median point estimate and Lower and Upper CI.
Delta_j_list <- append(Delta_j_list, npbt_method2$t0)
Lower <- append(Lower,as.numeric(npbt_method2$percent[1]))
Upper <- append(Upper,as.numeric(npbt_method2$percent[2]) )
}
Df_CI <- data.frame(Delta_j_list,Lower,Upper)
#Return Dataframe with info
return(Df_CI)
}
The LOCO function is executed with a 3 Proportion rate and a 0.90 confidence interval.
set.seed(12)
Df_CI_num <- round(Create_LOCO_num(train_num,Proportion=3,alpha=0.90),5)
data.frame(Variables = colnames(train_num)[1:10],Df_CI_num)
## Variables Delta_j_list Lower Upper
## 1 X1 0.00451 -0.00125 0.01022
## 2 X2 0.00274 -0.00986 0.01003
## 3 X3 0.00132 -0.00464 0.00712
## 4 X4 0.01221 -0.00527 0.02732
## 5 X5 0.01228 -0.00500 0.02835
## 6 X6 0.00341 -0.00529 0.01348
## 7 X7 -0.00051 -0.00228 0.00239
## 8 X8 0.00325 -0.00312 0.01613
## 9 X9 0.00252 0.00010 0.00658
## 10 X10 0.00224 -0.00217 0.00694
We organize the data such this, with Variables, point estimate values, and Upper and Lower values of our Confidence Interval.
No big importance in specific variables are expressed, but var 4 and 5 seems to increase the most, even if they are not actually statistically significant because the confidence interval contains the 0 values. Surprisingly, the 9-th variable have a statistically significant increase. We have also to remember how all these results are conditional to the D1 dataset, splitted randomly.
Lets compare this values with the variable importance of a random forest simple model.
#set.seed(1)
rf.model <- randomForest(formula = y ~ ., mtry = 4,
ntree=7000,
nodesize = 3,
sampsize = 100,
data = train, importance=T, cross=10)
rf.model
##
## Call:
## randomForest(formula = y ~ ., data = train, mtry = 4, ntree = 7000, nodesize = 3, sampsize = 100, importance = T, cross = 10)
## Type of random forest: classification
## Number of trees: 7000
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 46.5%
## Confusion matrix:
## autism control class.error
## autism 114 162 0.5869565
## control 117 207 0.3611111
rf.model$importance
## autism control MeanDecreaseAccuracy MeanDecreaseGini
## X1 -0.0004483489 0.0006731986 0.0001354311 3.903458
## X2 -0.0024280601 0.0011950082 -0.0004839328 4.171066
## X3 -0.0021514899 0.0023710858 0.0002714734 4.212056
## X4 -0.0045732746 0.0022751392 -0.0008876913 4.107452
## X5 0.0087899951 0.0079116850 0.0083028396 6.014679
## X6 -0.0022303858 -0.0010006117 -0.0015781595 4.053405
## X7 -0.0023099373 0.0007777612 -0.0006454461 4.446385
## X8 0.0033857157 0.0020052538 0.0026293549 4.991211
## X9 0.0014689981 0.0019613679 0.0017284052 4.850191
## X10 0.0003674925 -0.0011916044 -0.0004782161 4.434186
Here the importance of variable 5 here is enhanced, as the variable 9 remains a significant increase in prediction. All the other values are quite volatile in respect to the LOCO.
Actually the LOCO algorithm used on the KPCA components, as expected, does not seem to retain much information in respect to feature importance: the kpca variables work together well, and one does not seem to prevale on others.
Also is crucial how the conditionality on the random D1 dataset influence the results. Here we repeat the LOCO method some times, and see how much the information given change.
Also the Proportion rate is crucial to the variability of the Feature
importance measure:
Our last part of Final Homework consists in understanding if the works of Ovid have stylometric differences according to the periods. So at the beginning we import the dataset.
library(data.table)
esa_df <- fread(file='esa_scheme_only.csv')
Before proceeding with the implementation of the chosen methodology, let’s make a first exploration of the data.
summary(esa_df)
ncol(esa_df)
sum(as.numeric(esa_df[1,3:18]))
This allows us to have a clear visualization of compositional data.
Now we finally got inside our real work. We create, with the parameter \(d(Xr,Xs)\), the distance of Aitchison:
Aitch_distance<- function(xr,xs){
xr<-as.numeric(xr)
xs<-as.numeric(xs)
xr_geomean <- exp(mean(log(xr)))
xs_geomean <- exp(mean(log(xs)))
#xr_geomean <- geometric.mean(xr)
#xs_geomean <- geometric.mean(xs)
log_ratio_xr <- log(xr/xr_geomean)
log_ratio_xs <- log(xs/xs_geomean)
d<- sqrt(sum((log_ratio_xr-log_ratio_xs)^2,na.rm=TRUE))
return (d)
}
Aitch_distance(X[1,],X[2,])
Aitch_distance(X[1,],X[17,])
We can observ there is a problem with the zero.
X[17,]
sorted_X <- sort(as.matrix(X))
min_value <- sorted_X[2]
#substitute min_value/2 in X and in the original dataset
X[X==0]<-(min_value/2)
X[17,]
As suggest in the paper, we solved this problem by replacing 0 with half of the smallest value (except zero). Then we will need to use the real dataset “esa_df” as well as X:
esa_df[,3:18] <- X
esa_df[17,]
After that the distance matrix can be made:
Distance<- c()
for (i in 1:nrow(X)){
for (j in 1:nrow(X)){
#distance from every row
Distance<-c(Distance,Aitch_distance(X[i,],X[j,]))
}
}
Now we define the compositional Gaussian Kernel.
#we take h like the median distance to use in Gaussian Kernel function
h_median<-median(Distance)
h_median
comp_gaus_kern <- function(xr,xs,h=h_median){
dist <- Aitch_distance(xr,xs)
return(exp(-(1/h)*dist))
}
Finally we arrived at the first point of our Work. Our goal is to test the following system of hypothesis:
\[\begin{cases} H0: & There\ are\ not\ stylometric\ differences \\ H1: & There\ are\ stylometric\ differences \end{cases}\]But before we divide the dataset aggregating the works of the first and second period.
esa12 <- subset.data.frame(esa_df, period==1 | period==2)
esa3 <- subset.data.frame(esa_df, period==3)
#Create matrix data
X_tot <- esa_df[,3:18]
X12 <- esa12[,3:18]
X3 <- esa3[,3:18]
nrow(X3)
nrow(X12)
We want implement functions that calculates the approximation of Gretton with:
\[ M^2= {1\over m^2}\sum_{j}\sum_{k}K(Xj,Xk){-}2mn\sum_{j}\sum_{k}K(Xj,Yk)+{1\over n^2}\sum_{j}\sum_{k}K(Yj,Yk)\]
Sum_Kern_function <- function(X,Y){
#Sums up the kernels of two datasets
m <- nrow(X)
n <- nrow(Y)
somma <- 0
for(j in 1:m){
for(k in 1:n){
a<-comp_gaus_kern(X[j,],Y[k,])
somma<-sum(somma,a)
}
}
return(somma)
}
Sum_Kern_function(X3,X3)
Second function is implement to generate the M value
Generate_M <- function(X,Y){
m<-nrow(X)
n<-nrow(Y)
#compute sum of the quantities:
SumX <- Sum_Kern_function(X,X)
SumY <- Sum_Kern_function(Y,Y)
SumXY <- Sum_Kern_function(X,Y)
M_hat_2<- (1/m^2)*SumX - (2/(n*m))*SumXY + (1/n^2)*SumY
return(sqrt(M_hat_2))
}
M_hat <- Generate_M(X12,X3)
M_hat
## [1] 0.3898305
Dataset is divided to perform the Permutation test.
#Set aside the observation of the Double eroidhes
DH<- esa_df[which(is.na(esa_df$period)),]
# numeric matrix
X_dh <- DH[,3:18]
#Esa dataset without the Double heroides row
esa_df_wd <- esa_df[-which(is.na(esa_df$period)),]
#numeric matrix
X_wd <- esa_df_wd[,3:18]
The time allocation is permutated randomly and the test statistics M are recalculated.
library(kernlab)
Create_Permut_list <- function(N,X,dim){
#initialize List
M_list <- c()
for(i in 1:N){
#randomly extract list of idx from dataset
#with dimension dim
permut <- sample(nrow(X),dim, replace = FALSE)
#Create 2 datasets splitted randomly,
X_perm1 <- X[permut,]
X_perm2 <- X[-permut,]
#Create an M value from this random permutation datasets
M <- Generate_M(X_perm1,X_perm2)
#add to list
M_list[i] <- M
}
return(M_list)
}
This process is repeated N times.
#Lets test if our M_hat value is statistically significant
N=1000
#generate M List
M <- Create_Permut_list(N,X_wd,dim=10)
In the last part of point 1 we evaluate p-value.
mean(M>=M_hat)
## [1] 0.01
The p-value is 0.007, less then a 0.05, so we can refuse the null hypotesis and say that there are indeed stylistic differences between period 3 and the other two.
Using the above point, we proceed to verify if the observation related to the work Double Heroides (DH) shows stylistic differences compared to the other works of the author.
The following system of hypotheses will be studied:
\[\begin{cases} H0: & DH\ not\ has\ stylometric\ differences \\ H1: & DH\ has\ stylometric\ differences \end{cases}\]N=1000
M_hat_tot <- Generate_M(X_dh,X_wd)
M_hat_tot
M_tot_l <- Create_Permut_list(N,X_tot,1)
mean(M_tot_l>=M_hat_tot)
## [1] 0.969
Based on p.value we dont have enough evidence to reject H0 , so the Double Heroiding is concording with the Ovidio work.
Now lets heck period by period to understand a which one DH is attributable.
\[\begin{cases} H0: & DH\ not\ is\ attributable\ at\ corrent\ period \\ H1: & DH\ is\ attributable\ at\ corrent\ period \end{cases}\]The first:
#extract datasets
esa1 <- subset.data.frame(esa_df, period==1)
X1 <- esa1[,3:18]
#test statistic
M1_hat <- Generate_M(X1,X_dh)
#Add observation DH to dataframe
X1[nrow(X1)+1,] <- X_dh
#Create permutation
M_list_1per <- Create_Permut_list(N,X1,1)
#pvalue
mean(M_list_1per >= M1_hat)
## [1] 0.566
P.value too high, we don’t have an evidence to reject H0, so we can’t bring DH back to the first period.
Second period:
#extract datasets
esa2 <- subset.data.frame(esa_df, period==2)
X2 <- esa2[,3:18]
#test statistic
M2_hat <- Generate_M(X2,X_dh)
#Add observation DH to dataframe
X2[nrow(X2)+1,] <- X_dh
#Create permutation
M_list_2per <- Create_Permut_list(N,X2,1)
#pvalue
mean(M_list_2per >= M2_hat)
## [1] 0.67
The same as before. DH is not attributable to the second period.
Third period:
#extract datasets
esa3 <- subset.data.frame(esa_df, period==3)
X3 <- esa3[,3:18]
#test statistic
M3_hat <- Generate_M(X3,X_dh)
#Add observation DH to dataframe
X3[nrow(X3)+1,] <- X_dh
#Create permutation
M_list_3per <- Create_Permut_list(N,X3,1)
#pvalue
mean(M_list_3per >= M3_hat)
## [1] 1
Nothing change, the Ovid’s work is not even attributable to the third period.
The work Double Heroids does not appear to be related to any period. This result does not seem to agree with that of point 1 because the three periods show stylistic differences, therefore it was expected that the work DH was attributable to at least one of the three periods. Actually, the stylistic difference noted with does not allow the traceability of Double Heroids, which already presented an uncertainty in the dating. It can however be concluded that our analysis, although it does not resolve the question of dating, at least attests to the authorship of the work to Ovid.